1 Stable Performance Forcc - Numa Using First Touch
نویسندگان
چکیده
A key problem for shared-memory systems is unpredictable performance. A critical innuence on performance is page placement: a poor choice of home node can severely degrade application performance because of the increased latency of accessing remote rather than local data. Two approaches to page placement are the simple policies \\rst-touch" and \round-robin", but neither of these policies suits all applications. We examine the advantages of each strategy, the problems that can result from a poor choice of placement policy, and how these problems can be alleviated by using proxies. Proxies route remote read requests via intermediate nodes, where combining is used to reduce contention at the home node. Our simulation results indicate that by using reactive proxies with rst-touch page placement, performance is always better than using either page placement policy without proxies. These results suggest that the application programmer can obtain stable performance without knowing the underlying implementation of cc-numa, and can avoid time-consuming performance tuning. 1.1 INTRODUCTION Unpredictable performance anomalies have hampered the acceptance of coherent cache non-uniform memory access (cc-numa) shared-memory architectures. One source of performance problems is the location of shared data: each page of shared data is allocated in distributed memory at a home node, by the 1 2 operating system, when it is rst accessed. The choice of home node for a page is commonly on a rst-touch or round-robin basis, but neither of these policies is suited to all applications, and a poor choice of page placement policy can have a marked eeect on the performance of an application. In this paper we examine the eeects of simple page placement, and describe how a technique for reducing read contention in cc-numa machines can alleviate problems with inappropriate page placement. In the proxy protocol, we associate a small set of nodes with each location in shared memory, which act as intermediaries for remote read requests. Using execution-driven simulations, we show that using the reactive variant of proxies, in conjunction with rst-touch page placement, yields performance which is always better than using either of the simple page placement strategies without proxies. This suggests that, by using rst-touch page placement with reactive proxies, application programmers can be conndent that they will obtain stable performance. The rest of the paper is structured as follows: page placement is discussed in Section 1.2, and the proxy protocol is explained in Section 1.3. We describe our simulated architecture and experimental …
منابع مشابه
1 Stable Performance for Cc-numa Using First Touch Page Placement and Reactive Proxies
A key problem for shared-memory systems is unpredictable performance. A critical in uence on performance is page placement: a poor choice of home node can severely degrade application performance because of the increased latency of accessing remote rather than local data. Two approaches to page placement are the simple policies \ rst-touch" and \round-robin", but neither of these policies suits...
متن کاملData Distribution, Migration and Replication on a cc-NUMA Architecture
It is well known that, although cc-NUMA architectures allow construction of large scale shared memory systems, they are more difficult to program effectively because data locality is an important consideration. Support for specifying data distribution in OpenMP has been the subject of much debate [1], [4], and several proposed implementations. These take the form of data distribution directives...
متن کاملGeographical Locality and Dynamic Data Migration for OpenMP Implementations of Adaptive PDE Solvers
On cc-NUMA multi-processors, the non-uniformity of main memory latencies motivates the need for co-location of threads and data. We call this special form of data locality, geographical locality. In this article, we study the performance of a parallel PDE solver with adaptive mesh refinement. The solver is parallelized using OpenMP and the adaptive mesh refinement makes dynamic load balancing n...
متن کاملProceedings of the 6 th European Workshop on OpenMP EWOMP 2004
To achieve close to optimal performance on cc-NUMA systems for shared memory parallel applications with complex data access patterns, a mechanism for co-locating threads and the data during the execution of the program is needed. The affinity-on-next-touch procedure studied in this paper is based on re-doing the standard first-touch allocation at explicitly given locations in the code. We study...
متن کاملCyclin B degradation leads to NuMA release from dynein/dynactin and from spindle poles.
The protein NuMA localizes to mitotic spindle poles where it contributes to the organization of microtubules. In this study, we demonstrate that NuMA loses its stable association with the spindle poles after anaphase onset. Using extracts from Xenopus laevis eggs, we show that NuMA is dephosphorylated in anaphase and released from dynein and dynactin. In the presence of a nondegradable form of ...
متن کامل